Semi-supervised Graph-based Genre Classification for Web Pages
نویسندگان
چکیده
Until now, it is still unclear which set of features produces the best result in automatic genre classification on the web. Therefore, in the first set of experiments, we compared a wide range of contentbased features which are extracted from the data appearing within the web pages. The results show that lexical features such as word unigrams and character n-grams have more discriminative power in genre classification compared to features such as part-of-speech n-grams and text statistics. In a second set of experiments, with the aim of learning from the neighbouring web pages, we investigated the performance of a semi-supervised graphbased model, which is a novel technique in genre classification. The results show that our semi-supervised min-cut algorithm improves the overall genre classification accuracy. However, it seems that some genre classes benefit more from this graph-based model than others.
منابع مشابه
A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion
This paper describes a machine learning approach for detecting web spam. Each example in this classification task corresponds to 100 web pages from a host and the task is to predict whether this collection of pages represents spam or not. This task is part of the 2007 ECML/PKDD Graph Labeling Workshop’s Web Spam Challenge (track 2). Our approach begins by adding several human-engineered feature...
متن کاملGraph Labelling Workshop and Web Spam Challenge
We compare a wide range of semi-supervised learning techniques both for Web spam filtering and for telephone user churn classification. Semisupervised learning has the assumption that the label of a node in a graph is similar to those of its neighbors. In this paper we measure this phenomenon both for Web spam and telco churn. We conclude that spam is often linked to spam while honest pages are...
متن کاملIdentifying Genres of Web Pages
In this paper, we present an inferential model for text type and genre identification of web pages, where text types are inferred using a modified form of Bayes’ theorem, and genres are derived using a few simple if-then rules. As the genre system on the web is a complex reality, and web pages are much more unpredictable and individualized than paper documents, we propose this approach as an al...
متن کاملSemi-Supervised Learning: A Comparative Study for Web Spam and Telephone User Churn
We compare a wide range of semi-supervised learning techniques both for Web spam filtering and for telephone user churn classification. Semisupervised learning has the assumption that the label of a node in a graph is similar to those of its neighbors. In this paper we measure this phenomenon both for Web spam and telco churn. We conclude that spam is often linked to spam while honest pages are...
متن کاملWeb Page Classification Based on Uncorrelated Semi-Supervised Intra-View and Inter-View Manifold Discriminant Feature Extraction
Web page classification has attracted increasing research interest. It is intrinsically a multi-view and semi-supervised application, since web pages usually contain two or more types of data, such as text, hyperlinks and images, and unlabeled pages are generally much more than labeled ones. Web page data is commonly high-dimensional. Thus, how to extract useful features from this kind of data ...
متن کامل